Skip to content

QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes#2477

Merged
lauripiisang merged 5 commits into
mainfrom
feat/QVAC-19778-finish-reason-tokens
Jun 9, 2026
Merged

QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes#2477
lauripiisang merged 5 commits into
mainfrom
feat/QVAC-19778-finish-reason-tokens

Conversation

@lauripiisang

@lauripiisang lauripiisang commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

  • finish_reason was hardcoded to "stop" on all chat-category routes, even when generation was cut off by max_tokens (should be "length").
  • completion_tokens was counted three different ways: stats helper in /v1/chat/completions, whitespace split in /v1/completions (blocking), SSE-event counter in /v1/completions (streaming). The Responses route had its own inline fallback. All four could diverge for the same generation.
  • /v1/videos* endpoints were missing from both the in-repo and marketing-site docs.

📝 How does it solve it?

  • drainCompletion(result, onToken?) in adapters/openai/completion-result.ts consumes the SDK result.events stream once and returns { text, toolCalls, stats, stopReason, completionTokens, finishReason }.
  • Finish-reason precedence: tool_calls > length > stop. All three chat routes (chat, completions, responses) now use it.
  • Responses API: stopReason === "length" maps to status: "incomplete" + incomplete_details.reason: "max_output_tokens"; the streaming path emits response.incomplete instead of response.completed.
  • Docs: added /v1/videos* to packages/cli/docs/serve-openai.md and docs/website/content/docs/cli/http-server/index.mdx.

The two other QVAC-19778 follow-up items needed no code change — verified against the current tree:

🧪 How was it tested?

  • New test/completion-result.test.ts: 20 cases covering eos, length, stopSequence, tool_calls finish-reason branches; stats.generatedTokens vs whitespace-fallback token counting; streaming onToken callback ordering.
  • Extended test/responses.test.ts and test/responses-streaming.test.ts: assert status: "incomplete" + incomplete_details on max_tokens truncation.
  • New e2e cases in test/e2e.bats against a real loaded LLM: max_tokens: 1 with a long-output prompt, asserting finish_reason == "length" and usage.completion_tokens == 1 for both blocking and streaming paths.
  • All 374 CLI unit tests pass (bun run test:unit). Lint and typecheck clean (bun run lint).
  • All 134 e2e tests pass (npm run test:e2e, Qwen3-600M). Six max_tokens values bumped from 8–24 → 512 in e2e.bats for tests that asserted finish_reason: "stop": these were too tight for Qwen3's chain-of-thought <think> preamble after the fix correctly started returning "length" on budget exhaustion.

🔌 API Changes

// finish_reason now reflects truncation
{ finish_reason: "length" }  // when max_tokens cuts generation short (was always "stop")

// Responses API — length truncation (blocking)
{ status: "incomplete", incomplete_details: { reason: "max_output_tokens" } }
// Responses API — length truncation (streaming)
{ type: "response.incomplete" }  // was "response.completed"

// completion_tokens: now consistently uses stats.generatedTokens (with whitespace-split fallback)
// across /v1/chat/completions, /v1/completions, and /v1/responses

🔗 Dependencies

Depends on #2484 — SDK must emit stopReason: "length" before the two known-failing e2e tests (94, 95) pass.

@lauripiisang lauripiisang requested review from a team as code owners June 8, 2026 09:02
@lauripiisang lauripiisang marked this pull request as draft June 8, 2026 09:06
@lauripiisang

This comment was marked as outdated.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@lauripiisang lauripiisang force-pushed the feat/QVAC-19778-finish-reason-tokens branch from 2ac8659 to e71da37 Compare June 8, 2026 09:12
@lauripiisang lauripiisang changed the title QVAC-19778 feat[api]: audio encoding, finish_reason=length, and token accounting for chat-category routes QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes Jun 8, 2026
@lauripiisang lauripiisang force-pushed the feat/QVAC-19778-finish-reason-tokens branch from e71da37 to f4d049c Compare June 8, 2026 09:28
Introduce drainCompletion (completion-result.ts) to consume the SDK
event stream once: derives text, stats, stopReason, and finish_reason
(tool_calls > length > stop). All three chat-category routes use it;
per-route token-counting variants removed.

Responses API: length truncation maps to status "incomplete" +
incomplete_details.reason "max_output_tokens".

Two e2e tests for finish_reason=length are known-failing pending
SDK PR #2484 (stopReason="length" not yet emitted by the plugin).

Six e2e max_tokens bumped 8-24 → 512 for Qwen3-600M <think> compat.

Docs: /v1/videos* added to serve-openai.md and http-server mdx.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lauripiisang lauripiisang force-pushed the feat/QVAC-19778-finish-reason-tokens branch from 34b780a to 5a79aa6 Compare June 8, 2026 16:25
@lauripiisang lauripiisang marked this pull request as ready for review June 8, 2026 16:51
lauripiisang and others added 2 commits June 8, 2026 23:25
messages[].content now accepts OpenAI array format (text/image_url/
input_audio/file parts). Non-text parts are silently dropped; text
parts are concatenated. Fixes 400 errors from Cline and Open WebUI.

audio transcription/translation temperature was z.string() — OpenAI
spec requires number. Fixes 400 for any client sending temperature:0.0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Browsers (and any client using b64 MP3) send audio that whisper.cpp
cannot decode when piped as a raw buffer. By writing the buffer to a
temp file with the correct extension and passing the path string to
the SDK, format detection runs on-disk rather than on the buffer.

Both /v1/audio/transcriptions and /v1/audio/translations use the new
path. The temp file is deleted in a finally block regardless of
outcome. transcribeOverride type widened to string | Buffer to match
the SDK.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread packages/cli/src/serve/adapters/openai/completion-result.ts Outdated
Comment thread packages/cli/src/serve/adapters/openai/completion-result.ts Outdated
Comment thread packages/cli/src/serve/schemas/audio.ts Outdated
…oerce

drainCompletion: throw HttpError(502, inference_failed) when completionDone
carries stopReason=error (the events stream yields this normally, it does
not throw). Throw InferenceCancelledError on cancelled by awaiting
result.final after the loop, which is already rejected at that point.
Previously both cases returned partial 200. Fix JSDoc comment that wrongly
claimed result.events throws on errorDone.

audio schemas: z.coerce.number() for temperature on transcriptions and
translations. multipartToBody stringifies all non-file fields, so z.number()
without coerce rejected valid clients with 400, breaking the documented
"temperature is ignored" behavior.

Tests: add fakeRun cases for errorDone (expects HttpError 502) and
cancelledDone (expects InferenceCancelledError via rejecting final promise).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lauripiisang

Copy link
Copy Markdown
Contributor Author

/review

@lauripiisang lauripiisang merged commit cf85d37 into main Jun 9, 2026
10 of 11 checks passed
@lauripiisang lauripiisang deleted the feat/QVAC-19778-finish-reason-tokens branch June 9, 2026 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants